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SPARSE REGRESSION WITH HIGHLY CORRELATED PREDICTORS 


BEHROOZ GHORBANI, OZGUR YILMAZ 


Abstract. We consider a linear regression y = Xfi + u where X £ p ^ n, and 0 is s— sparse. 

Motivated by examples in financial and economic data, we consider the situation where X has highly 
correlated and clustered columns. To perform sparse recovery in this setting, we introduce the clustering 
removal algorithm (GRA), that seeks to decrease the correlation in X by removing the cluster structure 
without changing the parameter vector 0. We show that as long as certain assumptions hold about X, 
the decorrelated matrix will satisfy the restricted isometry property (RIP) with high probability. We also 
provide examples of the empirical performance of GRA and compare it with other sparse recovery techniques. 


1. Introduction 

We consider the sparse estimation (or sparse reeovery) problem given by: 

(1) Estimate j3 from y = Xf] + u 

where A is a known nxp design matrix, u is additive noise such that ||m ||2 < 0 for some known ry, and has at 
most s non-zero entries with s n p. Lasso [T^] and basis pursuit denoise (BPDN) [5] are popular sparse 
recovery approaches proposed for ([^, which are based on solving certain convex optimization problems. On 
the other hand, orthogonal matching pursuit (OMP) [14], CoSaMP [12], Iterative Hard Thresholding |3], and 
Hard Thresholding Pursuit m are examples of greedy algorithms that can be used to solve 0 . Although 
these algorithms have different capabilities in estimating /3, the success of each is highly dependent on the 
structure of the design matrix X and the sparsity level s in relation to p and n. It is well known that when 

XiX* 

the columns of X have high empirical correlation, i.e., for some i yf j, pij := v- ■ ■■ -- has magnitude 

\\Xi\\2\\Xj\\2 

close to one, the above mentioned sparse recovery methods do not perform well jj. 

In many applications, such as in model selection, X is already given in terms of observations of different 
variables of the data. Often, the variables constituting the columns of X are highly correlated. Therefore, 
at least for a significant number of columns i,j, \pij\ tends to be close to 1, and accordingly, standard sparse 
recovery methods, such as the ones we mention above, fail to yield good estimates for /3. 

In this note, we will focus on this issue, i.e., sparse estimation when the design matrix has correlated 
columns. Motivated by various applications where X is empirically generated, we will further assume that 
the columns of X can be organized in a number of clusters such that the columns that are in the same 
cluster are highly correlated but the columns that are in different clusters may or may not be correlated. 
For example, let Xf i be the price of stock i at time t. Due to shared underlying economic factors, we expect 
the price vector of, for example, technology stocks to be tightly clustered together. On the other hand, 
stocks of telecommunication companies may form a different cluster. Due to global economic factors, such 
as the monetary policy or the overall economic growth, the stock prices of telecommunication companies 
and technology companies may also be correlated, but we expect lower levels of correlation between them 
compared to the correlation inside the clusters. Another example where such cluster structure can be 
observed is in face recognition - see m for details. 

Our approach can be summarized as follows: To overcome the challenge posed by high correlation in the 
design matrix, we propose to modify X in order to get a matrix that is suitable for sparse recovery without 
changing the original parameter vector /3. We seek to do this by first identifying the clusters (say, q of them) 
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and then constructing a representative vector for each cluster. Let R G be the matrix constructed by 

putting together the representative vectors of the q clusters. We project X to the orthogonal complement 
of the range of R and normalize the columns of the resulting matrix, which we denote by X. Our proposed 
algorithm, we will prove, is effective when this matrix X is suitable as a compressive sensing measurement 
matrix. 

After introducing our notation and the necessary background in Sectionj^ we state the proposed clustering 
removal algorithm (CRA) and our main assumptions in Section In Section we prove that if A is a 
realization of a random matrix X whose columns are uniformly distributed on q disjoint spherical caps, 
X provides a suitable matrix for sparse recovery with high probability. In Section we demonstrate the 
performance of our algorithm on highly correlated financial data, in comparison with BPDN and with SWAP, 
an algorithm proposed by HZ] for sparse recovery in highly correlated settings. 

2. Notations and Background 

In what follows, we refer to random matrices and random vectors with bold letters. Let n, r S N and let 
A G We denote the Ah column of A by Ai. For T C [r] := {!,...,r}. At denotes the submatrix 

of A consisting of its columns indexed by T. We define 11^ to be the orthogonal projection operator into 
TZ{A), the range of A, and 11^^ denotes the orthogonal projection operator into the orthogonal complement 
of TZ{A). The best fc-term approximation of 6 S M" is G K" such that = bi if bi is among the k 
largest-in-magnitude entries of b. Otherwise, b] ‘ = 0. The corresponding best fc-term approximation error 
in £p is 

For 0 C 5'”“^ := {x S M" : ||x ||2 = 1}, we define the wedge W{£1) by 

W(fl) := {rx : X S O, r G [0,1]}. 

We denote the n-dimensional Lebesgue measure with A”, and the uniform spherical measure on 5'”“^ with 
which can be defined via 

:= A"(W(0))/A”(W(5"-i)) 

provided that VF(0) is a measurable subset of K" (otherwise O is not measurable). It should be noted that 
although different definitions of uniform spherical measure exist in the literature, all these measures must 
be equal, for example, in the setting that we have described above [7]. 

2.1. Compressive sampling. It is well known in the compressive sampling literature that BPDN provides 
a stable and robust solution for the sparse estimation problem 0 in a computationally tractable way , e.g., 
[Hi]. Specifically, BPDN provides the estimate 

(2) r,) — argnun||z||i s.t \\Xz-y\\ 2 <ri 

where q is an upper bound on ||u|| 2 . One can guarantee that ,0Bpj3ivi(y x r/) ^ good estimate of if /3 is 

sufficiently sparse or compressible (i.e., it can be well approximated by a sparse vector) and the “restricted 
isometry constants” of X are sufficiently small |5], where the restricted isometry constants are defined as 
follows. 

Definition 1. Let i5s(A) be the smallest positive constant such that VT C [p], \T\ = s, Vz G 

(l-5,(A))||z||2<||Apz||^<(l + 5,(A))||z||2. 

In this case, we say that X satisfies the restricted isometry property (RIP) of order s with (restricted 
isometry) constant Ss- 

The following theorem by jS] is a sharper version of the original “stable and robust recovery theorem” of 


Theorem 2. Let dts{X) < 


^ for some t > -. Then, for any s-sparse fi S 


11 ^: 


# 




^BPDN(y. X. 7,) A-11^- 

where /3BPDN(y, x, ri) is as in ([^, and C, D > 0 are constants depending only on 5ts- 


Even though the above theorem provides guarantees for BPDN to recover (or estimate) a sparse or 
compressible vector /3 from its (possibly noisy) compressive samples y, these guarantees depend on X having 
small restricted isometry constants. Constructing matrices with (nearly) optimally small restricted isometry 
constants is an extremely challenging task and still an open problem. Furthermore, computing these constants 
entails a combinatorial computational complexity and is not tractable as the size of the matrix increases. 
As a remedy, the literature has focused on random matrices. Indeed, one can prove that certain classes of 
sub-Gaussian random matrices (see next section) have nearly optimally small restricted isometry constants 
with overwhelming probability. Accordingly, realizations of such random matrices are used in the context of 
compressed sensing. We will adopt such an approach. 


2.2. Sub-Gaussian random matrices. Our theoretical analysis in Section [^relies on certain fundamental 
properties of sub-Gaussian random matrices. Here, we state some basic definitions that we will need. See 
|18| for a thorough exposition on the non-asymptotic theory of sub-Gaussian random matrices. In what 
follows, we stick to the notation of |18| . 

Definition 3. A random variable x is said to be a sub-Gaussian random variable if it satisfies 

for all g > 1. Its sub-Gaussian norm ||x ||^2 is defined as 

|lx||^2 = supg“^/^(E|x|'J)^^‘^. 

9>1 

Next we define the following two important classes of random vectors. 

Definition 4. Let x be a random vector in M". 

(i) X is said to be isotropic if E(x, = ||rt|p for all u G K". 

(ii) X is sub-Gaussian if the marginals (x, m) are sub-Gaussian random variables for all u G K". The 
sub-Gaussian norm of the random vector x is defined by 

||x||^, := sup \\{x,u)\\^^. 


3. Assumptions and the Algorithm 

We now go back to the sparse estimation problem Q, i.e., we want to estimate /3 from y = Xp + u 
when given the matrix X and the fact that ||u|j 2 ? 7 . For the purpose of this paper, we assume that ||Aii ||2 = 1 
Vf G [p]. This simply can be done by normalizing the columns and rescaling /3. Below, Cz denotes a spherical 
cap, i.e., a portion of 5'"“^ cut off by some hyperplane P, with “centroid” z, which is the point on the cap 
that has maximum distance from P. 

We make the following additional assumptions. 

Assumption 1. The data matrix A is a realization of a random matrix X. 

Assumption 2. There exist zi,...,Zq G S^~^, q n, such that Vf € [p], 3j G [g] s.t Xi is uniformly 
distributed on Cz^- That is, Xi is distributed according to the measure mc^., where for all measurable 
subsets A of Cz , 

mc./A) := A"(W(A))/A”(W(C.,)). 

In this case, we write X^ ~ Cz^- 

Assumption 3. For i j, Xi and Xj are independent. 
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Algorithm 1 Clustering Removal Algorithm (CRA) 

Step 1:: Estimate Gi, G 2 ,Gg. 

Step 2:: Estimate R := [zi|z 2 |---|-Z( 3 ] by setting Zi = ^ Xj. 

' j&Gi 

Step 3:: Use a sparse recovery method for obtaining an estimate 7 for 7 from 
(3) y = Xj + u 

where y = Ui^±y, X = u = 

Step 4:: The estimated /3 will be /3 := 


Clearly, by Assumption]^ Xi can be clustered into q groups Gi,... ,Gq where 

G, = {tG [p] s.t. X, ^ G,J 

Under these assumptions, we propose the clustering removal algorithm (CRA), described in Algorithm]^ 
for estimating /3 in Q. In the first step of CRA, we estimate Gj, j G [g] by clustering the columns of X using 
an appropriate clustering algorithm. For the numerical simulations in this paper, we use fc-means clustering. 
Note that here we make the additional assumptions that we know q and it is possible to estimate Gj, which, 
for example, requires that the spherical caps Gz^ are well separated. In step 2, we estimate the centroid Zj 
of each cluster Gj by averaging the columns of X that belong to Gj . Note that this is an unbiased estimate 
for Zj assuming the clusters have been accurately identified. In step 3, we project each column of X onto 
the orthogonal complement of the g-dimensional subspace spanned by the cluster centroids (without loss of 
generality, we assume that the set of cluster centroids {zi,..., z^} is linearly independent). We normalize the 
projected columns by multiplying the matrix with a diagonal matrix where (An^^ jc)i,i = | A^j I 2 . 

After normalization, we obtain our modified matrix A := Ilj^i AAjj"^^ which, as we will show in the next 
section, is an appropriate measurement matrix for sparse recovery. The main idea here is that the columns 
of A lie on the (n — g)-dimensional subspace and, in fact, we show below that A is a realization of 

X := II/jiXAjj^^^x where X^ are independent and uniformly distributed on 5'”“^ n7^(i?)"*~. Accordingly, A 
turns out to be a good compressive sampling matrix, as explained in the next section, and we use a sparse 
recovery algorithm of our choice to obtain 7 , an estimate of 7 , from ([^. In the final step, we “unnormalize” 
by multiplying 7 with the diagonal matrix 

4. Theoretical Recovery Guarantees for CRA 

In this section, we assume that the clusters Gj and their centroids Zj, j = 1,..., g are known (or accurately 
estimated). Under this assumption, we show that X, with high probability, is an appropriate compressive 
sampling matrix. The following two theorems from |18| will be instrumental in our theoretical analysis of 
CRA. 

Theorem 5. |18l Theorem 5.65] Let Z be an nx p random matrix with n < p and ||Zi ||2 = 1 for all i G [p]. 
Let k G [p] and 5 G (0,1). Suppose that 

(i) the eolumns of Ti are independent, and 

(ii) the eolumns of y/nZ are suh-Gaussian and isotropic. 

Then, there exists positive constants c and G, depending only on the maximum suh-Gaussian norm K := 
maxi ||Xi ||.02 of the columns ofZ, such that 

(4) n>GS-^k\og{^) 6k{Z)<d 

with probability at least 1 — 2 exp(—c5^n). 

Theorem 6. |T5] Let z be an n X 1 random vector. Lf z is uniformly distributed on y/nS'^ then z is both 
sub-Gaussian and isotropic. Furthermore, the suh-Gaussian norm of z is bounded by a universal constant. 
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Next, we will prove that the columns of satisfy the hypotheses of Theorem To that end, for z S K" 
define 

fo„xi ifn^iZ = 0, 

^ ^ 1 iin otherwise. 

Note that = a;(Xi). 

Theorem 7. If Assumption^ and Assumption^hold, uj(X.i) is uniformly distributed on 5'"“^ mZ{R)'^ for 
each is [n]. Moreover, for i fy j, a;(Xi) and a;(Xj) are independent. 

Proof. Since X^ and Xj are independent, and w : K” i—)■ K" is measurable, a;(Xi) and a;(Xj) are also 
independent. Next we show that a;(Xi) is uniformly distributed on S'^~^ mZ{R)-^. To that end, let X^ S Chi 
and let P be the hyperplane that generates Cfh ■ Then, Ri is a normal vector of P and accordingly we have 

P = {xGM” : {x-q,Ri) =0} 

where q is an arbitrary point on P that we fix. Recall, on the other hand, that 

(5) Cn,={xGS^-^ ■.{x,R,)>{q,Ri)} 

where g S P is as above. Note that (g. Pi) is independent of the choice of g. 

Now, let n C n TZ{R)-^. Noting that ||Xi ||2 = 1, we observe that 

(6) P(a;(Xi) e f]) =P(nij^Xi e fT(f])\{0}) 

(7) = P(nij^Xi G Vb(f7)) - P(nij^Xi = 0) 

since 0 G W{Vl). We know that any vector z G M" can be uniquely written as zr + where zr G TZ{R) 
and Zji± G TZ{R)^. Note that n^^iz = Aj^^zr + Ar±Zri. = Zr±. Therefore, Ar±z G W(f2) if and only if 
Zr± G tT(fl) which is holds if and only if z is an element of 

n“^(f2) := {ru + Rv : r G [0,1], u G fl, u G M'^}. 

Hence, keeping in mind that X^ G Cr^ , we have 

p(nij^Xi G w{n)) = P(Xi G n-i(H) n Cr,) 

and 

mRrX; = 0) = P(Xi GP(P) n Cr,). 

Furthermore, since the distribution of Xi is continuous, and TZ(R) has a lower dimension compared to Cr.,^, 
P(Xi GCr^ n P(P)) = 0. Accordingly, it follows from Q that 

(8) P(u;(Xi) G H) = P(Xi G n-i(H)nCflJ. 


Next, recall that X^ is uniformly distributed on Cr,^, i.e., X^ is distributed according to the measure 
defined as in Assumption!^ Therefore, we can rewrite ([^ as 


( 9 ) 


P(a;(Xi) G n) 



n-i(n)nCfli 



w(n-i(n)nCHi) 


dX^jy) 



n-i(n)nw(Ciii) 


dX^jy) 

XHWiCR,)) 


n-i(n)) 
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where lvy(CHj indicator function of the set W{Cr-^), i.e., 


lw(Ca^){y) ■■= 


1 ifyeH^(CflJ 

0 otherwise 


Above, the second equality follows from the definition of the measure and the third equality holds 

because VF(n“^(fl) n Cr^) = 11“^(fl) n W{Cr^)^ which is easy to verify. 

Next, we decompose 11“^ (fl) = 14^(0) x TZ{R) and denote by and A"“'^ the Lebesgue measure on 
TZ{R) and on TZ{R)^ respectively. Note that bF(fl) and TZ{R), and thus every y G n”^(n) can be uniquely 
decomposed as y = yR± + yR such that yR± G W{^) and yR G TZ{R). Then, we obtain 

P(a;(Xi) G n) 


WiQ) 


n{R) 


+ Ur) 


dX'^iyR) 


dX^-’^iVR.) 


( 10 ) 


= / / 




’■W(Ch, )('■“ + »«) 

/ ^"(iv(Ch )) d.^'^ivn) 

r! [ 0 , 1 ] [■«(«) ' ' Hi" 

Above, first equality follows from Fubini-Tonelli’s theorem, and the second equality is obtained by passing to 
polar coordinates and using Fubini-Tonelli’s theorem one more time. Finally, is the unique spherical 

measure on 5”"“^ n TZ{R)^, i.e., the unit sphere of the subspace TZ{R)^. 

Next, we observe that + yR) = 1 if and only if one of the following two statements hold: 

(i) ru + yR = 0, which, since u is orthogonal to yR, is equivalent to having -|- Hj/hIH = 0 , or 

(ii) 0 < -h \\yR\\l < 1 and 

ru + v ^ 

e Cr, 


\ru -I- v \\2 


which, by (§, holds if and only if 

{yR,Ri) > {r^ + \\yR\\l) 

where g is as in ([^. 


'{q,Ri) 


Accordingly, we conclude that the innermost integral in (101 does not depend on u, which implies that 


( 11 ) 


[ 0 . 1 ] 


TZ{R) 


xhw{CrJ) 


•dX'^iyR.) 


-1-^dr = 


c, 


where c is a constant that only depends on n. Substituting ( |11| into ( |10[ ), we obtain 

P(a;(Xi) G fl) =ca”-«-i(fl). 

Therefore, we conclude that a;(Xi) is uniformly distributed on n TZ{R)^. 


□ 


Next, we will show that X, with high probability, is an appropriate compressive sampling matrix, and 
thus we can use the proposed clustering removal algorithm (CRA) to obtain an accurate estimate of /3 in 
Q. To that end, let U'^ : K" i—)■ K" be a unitary transformation such that t/^(7?.(i?)'*-) = spanjei,..., e„_q} 
where e, are the standard basis vectors. Then we can write 


t7"X = 


qxp 


where A is a (n — g) x p random matrix. Due to the rotation invariance of the Lebesgue measure, and since 
[/ is a deterministic matrix, the columns of C/’^X are independently uniformly distributed over 

n 7^(i?)^) = {:r G : x^-q+i = ... = = 0}. 

This means that the columns of A are uniformly distributed over 5'"“'!“^. Theoremj^and Theorem [^provide 
sufficient conditions for A to satisfy RIP with overwhelming probability. More precisely, the following holds. 

Corollary 8 . If {n — q) > C6~'^klog{^), then with probability at least 1 — 2exp(—C(5^(n — g)), 6k{A) < 6. 















Next, noting that X = t7 


9XP. 


, we relate the RIP constants of X with the RIP constants of A. 

^ A ^ 


Proposition 9. If A satisfies the restricted isometry property with constant Sk, U 
RIP with the exact same constant. 


gxp. 


will also satisfy 


Proof. First, recall that a matrix Z G satisfies RIP of order k with constant 6k if and only if for all 

I C [p] with |J| = k, the eigenvalues of Zf Zj are all between 1 — Sk and 1 + 6k. Here Zj denotes the 
submatrix of Z consisting of its columns indexed by I. 

Suppose that A satisfies RIP of order k with constant Sk. Set 

X = U 

and let / be any subset of [p] such that |J| = k. Then, 

Ai 




XfXi = 


Ai 

Ogx fc 


= AjAi. 


qxk 

n T 

U^U 


(u y (u ) 

V Uoxfe / V Oqxk / 


Ai 

^qxk 


Therefore, for all I C [p], eigenvalues of XjXj and AjAj are equal. Hence, X satisfies RIP of order k 
with constant Sk if A satisfies RIP of order k with the same constant. □ 

Proposition [^suggests that if the assumptions of Corollary are met, X satisfies RIP with overwhelming 
probability. This, in turn, provides uniform estimation guarantees for 7 , as defined in ([^. In particular, 
combining Corollaryand Proposition]^ we arrive at the following conclusion. 

Theorem 10. Consider Prohlem^and suppose that the data matrix X satisfies Assumptions^!^ 

(i) If {n — q) > C{t — l)slog(|^) for some t > 4/3, then Sts{X) < ^krA yjith probability greater than 

1 - exp ( - c^(n - q)). 

(a) Consequently, for any s-sparse 7 € 

where D,C>0 are constants depending only on Sts. Here 7 is as in (|^. 

(Hi) The estimate in (ii) implies the following bound for the estimate of j3: 

2C(Ti(7), 


' - p \\2 < max ■ 




■{Dr] 


Remark 1. Note that part (i) of Theorem 
In particular, we can set t = 2, which yields 


10 


is a restatement of Corollary 


me sufficient condition 

,ep^ 


with S = \l and k = ts. 


{n-q)> C'slog(—). 

Part (ii) of Theorem 10 follows from Theorem together with part (i). Part (iii) can be justified using the 
definition of 7 as well as Step 4 of Algorithm 

Remark 2. Part (iii) of Theorem 10 shows that in the estimation of fi the error term is inflated and behaves 
like max(l/| AilI 2 ). However, it should be noted that no matter what method is used, sparse recovery in 
highly correlated settings is inherently unstable when the noise level is high. For example, let Ai and Xj be 
unit-norm and highly correlated, and let Tp be the indices of the true support of fi. Assume i G T/i but j ^ Tp. 
Then, without any assumptions on the structure of the noise u, as soon as ||u ||2 exceeds ||Ai — Ajj| 2 |/?i| it 
becomes impossible to distinguish whether i G Tp or j G Tp by just considering y. If Ai and Xj are highly 
correlated, then |jAi — XjW^ tends to be small and unless fit is extremely large, ||Ai — AjH 2 |^i| tends to be 
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small too. Hence, nearly accurate sparse recovery is possible only for modest amounts of ||u|| 2 - As we will 
see in section the numerical results corroborate this result. 

Remark 3. Our analysis in this section has been based on the (unrealistic) assumption that TZ{R)^ can 
be perfectly estimated. Let E be an imperfect estimate of TZ{R)^. By some algebra it can be shown that 
if E is reasonably accurate, one can bound the matrix norm of the difference of X and X := TIeXN^^^. 
From there on we refer the reader to the analysis of m which suggests that X will still satisfy the restricted 
isometry property but with different constants, which depend on the restricted isometry constants of X and 
the accuracy of E. We also note that the numerical experiments we present in the next section use design 
matrices for which clusters or TZ{R)^ are not known a priori and can only be imperfectly estimated. Yet, 
the empirical results illustrate that CRA is still effective in these examples. 


5. Empirical Performance 


For the purpose of testing the empirical performance of our algorithm, we provide two sets of experiments. 
In the first one, we generate X synthetically from a factor model. In the second experiment, we use correlated 
stock price time-series as the columns of X. In each experiment, we run multiple trials in which we randomly 
choose a 20—sparse /3 such that every non-zero entry is uniformly distributed in [1,2]. Using this chosen 
/3, we generate a y vector and test how well our algorithms can estimate /3 given y and X. We report the 
average performance of each algorithm across all trials. 

Following the suggestion of [2], beside CRA, basis pursuit, and SWAP, we add two new estimators: CRA- 
OLS, and BPDN-OLS. In CRA-OLS and BPDN-OLS, we use the supports of and Pbpdn as the 

support estimate of /?. Let Th be this estimated support. We estimate the non-zero entries of /3 by Xl y. 

Tf) 


When generating y from /3, we choose the noise vector u according to aN(0nxij Inxn) distribution, where 
a is chosen such that the SNR is equal to a dB with a G {10,15, 20,..., 100}. For the initialization step of 
SWAP and the sparse recovery step of CRA, we use BPDN, with ||u ||2 as its parameter. The solver that we 
use for basis pursuit is SPGLl nang. For each noise level, we generate 30 realizations of /3 and as our first 


performance measure, we report the average over these trials of 


II/3-/3<^°>||2 


across different noise levels. 


for all five algorithms. 

To empirically test the support recovery performance of CRA, we next define the true positive rate of (3 
to be 


TPR(ffl := l“PP(-9) n “PP(/?I 
|supp(/3)| 


In support recovery literature, the goal is to have a sparse j3 such that TPR(/3) is as close to one as possible. 
As our second measure of performance, we report the average TPR over 30 trials (as described above) of 
CRA, BPDN, and SWAP across different noise levels. In noisy settings, results of basis pursuit and CRA 
have a large number of non-zero entries. Therefore, although Pcra and Pbpdn contain a large proportion 
of the true support, they are not informative for identifying it. To have a reasonable measure of performance 
for CRA and BPDN, we report TPR(/3^^^^) and TPR(^^^°^jy). 


5.1. Synthetic Data. In our first experiment, we aim to test the performance of our algorithm on a 
synthetically generated data. Inspired by factor models, which are widely used in econometrics and finance 
(we refer to [T] for a survey on factor models and their applications), we generate the data matrix, X, 
according to 

X = EA + Z 

where F G A G Z G R"^^, and q <^n. The columns of F are the underlying factors that drive 

the cross-correlation among the columns of X. A is the coefficient matrix which determines the extent of 
the exposure of each column of X to the Fi’s. Z is the idiosyncratic variation in the data, which in this 
experiment is generated according to an i.i.d Gaussian distribution. We generate the Fi according to an 
ARMA(2,0) model as follows: 

Ft,i = 0.5Ft_iy -|- 0.3Ft-2,i + Vt, Vt ^ 1) 
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Figure 1. The SVD comparison of X, X, and a uniformly distributed matrix in the “syn¬ 
thetic data” experiment of Section |5.1[ 


This formulation ensures that entries of X are reasonably correlated both across rows and columns and 
therefore, it gives the data a rich correlation structure. We choose n = 250, p = 1000, and q = 25. 

Even though the data is not explicitly decomposed of clusters distributed uniformly on spherical caps, we 
can still apply CRA to this problem. In this case, the columns of F will play the role of cluster centers. Due 
to the factor structure of the data, and since we only need to estimate the Tl{F) to construct X, we use the 
range of the most significant q eigenvectors of XX' for estimating Figure [^demonstrates the singular 


values of X on the left hand side, and singular values of X on the right. For comparison purposes, we have 
also plotted the singular values of a 250 x 1000 matrix with columns uniformly distributed on As we 
can see, the singular values of X closely resemble those of the matrix with uniformly distributed columns. 


Figure 


shows the average value of 


||/ 3 -/ 3 { 20 }||, 

mh 


for all of our algorithms, across various noise levels. As 


Figure [^demonstrates, for SNR levels more than 30 dB, CRA-OLS and CRA have the best performance. For 
SNR levels less than 30 dB, the estimation quality deteriorates significantly to the extent that no algorithm 
provides a reasonable estimate of /3. 

Figure [^compares the true positive rate of the algorithms. As the figure suggests, CRA recovers most of 
the true support up until 30 dB. Afterwards, the estimation quality falls significantly for all of the algorithms. 

The average running time of the algorithms, all on the same machine, are presented in Table[2 Empirically, 
we observe that SPGLl solves the CRA problem, which does not have high correlation in it, much faster. 
Moreover both CRA and BPDN are considerably faster than SWAP. 


Algorithm 

Running Times (s) 

CRA (without clustering) 

0.044 

clustering with eigenvectors 

0.023 

BPDN 

0.730 

SWAP 

4.100 


Table 1. 
level for 19 


Average running times of various algorithms over 570 trials (30 trials per noise 


different values of a) for the example presented in Section 5.1 


5.2. Pseudo-Real Data. 


1 


We can also use k-means clustering for this purpose and the estimation results are very similar. 
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Figure 2. The average estimation error across different noise levels in the “synthetic data” 
experiment of Section [5.1| Each data point corresponds to the average estimation error over 
30 realizations. 



Figure 3. The true positive rate across different noise levels in the “synthetic data” exper¬ 
iment of Section 5.1 Each data point corresponds to the average true positive rate over 30 
realizations. 


5.2.1. The Data Description. For the purpose of the numerical simulations with real data, we consider 
daily stock prices. Daily stock prices present us with a real life data set with a complex high correlation 
environment. We use the prices of 2000 NASDAQ stocks recorded from June 15 2012 to June 15 201^0 
After removing the holidays, non-trading days, and special trading session, each stock has 500 observations. 
To give some structure to the data, we remove the trends from each time series, and also rescale the I 2 norm 
of each series to -y/n. After this stage, the data is used to populate the matrix X G ]^500x2000^ FigureHis the 

1 1 

graphical representation of —X*X. It is evident that, even after all modification on the data, —X*X is very 

n n 

different from identity, and hence X does not fit into the category of suitable matrices for sparse recovery. 


^The data is downloaded from Yahoo! Finance 


10 














1 



Figure 4. The matrix -X*X. 
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Figure 5. The covariance matrix of 


5.2.2. Results. By using k-means clustering, we identify 14 clusters in the data, hence R € k500x 14^ Figurej^ 
is the graphical representation of ^ . As we can see, we have a tremendous 

improvement in the covariance structure of the data matrix. 

Figure]^ shows the singular values of X on the left hand side, and singular values of X on the right. For 
comparison we include in the left figure the plot of the singular values of a 500 x 2000 matrix with columns 
uniformly distributed on Although there is considerable difference between the singular values of X 

and those of the matrix with uniformly distributed columns, we observe a vast improvement compared to 
the original case. This improvement is in fact sufficient for us to successfully perform sparse estimation. One 
can choose more than 14 clusters or use a more advanced clustering method to make the singular values of 
X closer to the uniform case. However, we empirically observed that this does not improve the estimation 
results significantly. 
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Figure 6 . The SVD comparison of X, X, and a uniformly distributed matrix in the 
“pseudo-real data” experiment of Section |5.2[ 




Figure demonstrates the average value of 




for all five algorithms. As Figure 


shows, in 


low noise settings, CRA-OLS and CRA have the best performance among the algorithms. As the noise level 
increases, distinguishing between the columns of X becomes almost impossible. For noise levels more than 
20 dB, none of the algorithms provide a reasonable estimate of /?. 



Figure 7. The average estimation error across different noise levels in the “pseudo-real 
data” experiment of Section |5.2[ Each data point corresponds to the average estimation 
error over 30 realizations. 

Figure shows the average value of TPR(/3’t^°^) across different noise levels, for each algorithm. It can 
be observed that again CRA outperforms both SWAP and BPDN. 

In SWAP, the sparsity rate of the final vector is always equal to the size of the initial support provided, 
I To I- To make sure that SWAP’s estimate has a high TPR ( although in the expense of adding false positive 
indices), [T7] suggest providing a larger Tq. Indeed, our results suggest that if we allow SWAP to search for 
a 30-sparse vector in estimating a 20-sparse /3, TPR(/3 sveap) increases. However, as the size of the support 
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Figure 8 . The true positive rate across different noise levels in the “pseudo-real data” 
experiment of Section |5.2[ Each data point corresponds to the average true positive rate 
over 30 realizations. 

that SWAP is working with increases, the running time of SWAP increases significantly. The following table 
represent the average running time of the algorithms on the same machin^ 


Algorithm 

Running Times (s) 

BPDN 

0.90 

K-means Clustering 

2.48 

CRA (with Clustering) 

3.70 

SWAP (20-sparse) 

18.57 

SWAP (30-sparse) 

146.27 


Table 2. Average running times of various algorithms over 570 trials (30 trials per noise 


level for 19 different values of a) for the example presented in Section 5.2 


A second point to consider is the sensitivity of the algorithm’s performance to the number of observations. 
We run the same experiment as above but we discard the first 250 observations so X G ]^250x2000^ Figure 
1^ shows the difference between the average TPR with 500 observations and the average TPR with 250 
observations for each algorithm. In comparison to other algorithms, CRA’s performance falls much less 
when the number of observations decreases while SWAP’s performance degrades significantly. 


6. Conclusions and Future Work 


CRA is a computationally efficient estimation method that allows for sparse recovery in highly corre¬ 
lated environments. As we showed in Section]^ even with extremely simple clustering algorithms, CRA’s 
empirical performance is superior to other state-of-the-art sparse recovery algorithms. Moreover, CRA is 
computationally efficient (as reflected in the run times given in Tables and which makes it suitable for 
working with large data sets. In addition, the algorithm provides a large degree of flexibility in the choices 
of the sparse recovery technique and clustering method. Accordingly, CRA is a versatile sparse estimation 
method for cases where the design matrix (or equivalently, the compressive sensing measurement matrix) 
has correlated columns that can be grouped into a small number of clusters as we described before. 

Despite all these strengths, there are certain setting where one may need to modify CRA. First, as we 
noted earlier, in noisy environments, as the data becomes more and more highly correlated, distinguishing 


'^The experiments presented in Sections 


5.1 


and 


5.2 


times presented in Tableland Tablecannot be compared directly. 


are conducted on two different machines respectively, so the running 
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Figure 9. The difference between the performances of each algorithm when the number of 
observations is reduced from 500 to 250. 

the columns from each other becomes impossible. In this case, to decrease the correlation in the matrix, 
one might want to group the columns that are extremely correlated with each other and then run CRA 
on the representative vectors of the grouped variables. There are many approaches for variable grouping, 
e.g., g and [5] introduce two new automated grouping algorithms and provide a short survey of the field. 
Combining CRA with these variable grouping algorithms is an interesting subject for future research. 

Another interesting problem is whether one can relax the rigid cluster structure imposed by our assump¬ 
tions. This would mean that we adapt CRA for sparse recovery in the case of design matrices with correlated 
columns that do not satisfy our Assumptions [T][^ 
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